{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "51c3407b-6041-4217-9ef9-a0e619a51603",
   "metadata": {},
   "source": [
    "# Create custom multi-hop queries from your documents\n",
    "\n",
    "In this tutorial you will get to learn how to create custom multi-hop queries from your documents. This is a very powerful feature that allows you to create queries that are not possible with the standard query types. This also helps you to create queries that are more specific to your use case."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6d0a971b",
   "metadata": {},
   "source": [
    "### Load sample documents\n",
    "I am using documents from [gitlab handbook](https://huggingface.co/datasets/explodinggradients/Sample_Docs_Markdown). You can download it by running the below command."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dd7e01c8",
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "! git clone https://huggingface.co/datasets/explodinggradients/Sample_Docs_Markdown"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "5e3647cd-f754-4f05-a5ea-488b6a6affaf",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.document_loaders import DirectoryLoader, TextLoader\n",
    "\n",
    "path = \"Sample_Docs_Markdown/\"\n",
    "loader = DirectoryLoader(path, glob=\"**/*.md\")\n",
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7db0c75d",
   "metadata": {},
   "source": [
    "### Create KG\n",
    "\n",
    "Create a base knowledge graph with the documents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "9034eaf0-e6d8-41d1-943b-594331972f69",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/homebrew/Caskroom/miniforge/base/envs/ragas/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n"
     ]
    }
   ],
   "source": [
    "from ragas.testset.graph import KnowledgeGraph\n",
    "from ragas.testset.graph import Node, NodeType\n",
    "\n",
    "\n",
    "kg = KnowledgeGraph()\n",
    "for doc in docs:\n",
    "    kg.nodes.append(\n",
    "        Node(\n",
    "            type=NodeType.DOCUMENT,\n",
    "            properties={\n",
    "                \"page_content\": doc.page_content,\n",
    "                \"document_metadata\": doc.metadata,\n",
    "            },\n",
    "        )\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fa9b3f77",
   "metadata": {},
   "source": [
    "### Set up the LLM and Embedding Model\n",
    "You may use any of [your choice](/docs/howtos/customizations/customize_models.md), here I am using models from open-ai."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "52f6d1ae-c9ed-4d82-99d7-d130a36e41e8",
   "metadata": {},
   "outputs": [],
   "source": [
    "from ragas.llms.base import llm_factory\n",
    "from ragas.embeddings.base import embedding_factory\n",
    "\n",
    "llm = llm_factory()\n",
    "embedding = embedding_factory()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f22a543f",
   "metadata": {},
   "source": [
    "### Setup Extractors and Relationship builders\n",
    "\n",
    "To create multi-hop queries you need to undestand the set of documents that can be used for it. Ragas uses relationships between documents/nodes to quality nodes for creating multi-hop queries. To concretize, if Node A and Node B and conencted by a relationship (say entity or keyphrase overlap) then you can create a multi-hop query between them.\n",
    "\n",
    "Here we are using 2 extractors and 2 relationship builders.\n",
    "- Headline extrator: Extracts headlines from the documents\n",
    "- Keyphrase extractor: Extracts keyphrases from the documents\n",
    "- Headline splitter: Splits the document into nodes based on headlines\n",
    "- OverlapScore Builder: Builds relationship between nodes based on keyphrase overlap"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "1308cf70-486c-4fc3-be9a-2401e9455312",
   "metadata": {},
   "outputs": [],
   "source": [
    "from ragas.testset.transforms import Parallel, apply_transforms\n",
    "from ragas.testset.transforms import (\n",
    "    HeadlinesExtractor,\n",
    "    HeadlineSplitter,\n",
    "    KeyphrasesExtractor,\n",
    "    OverlapScoreBuilder,\n",
    ")\n",
    "\n",
    "\n",
    "headline_extractor = HeadlinesExtractor(llm=llm)\n",
    "headline_splitter = HeadlineSplitter(min_tokens=300, max_tokens=1000)\n",
    "keyphrase_extractor = KeyphrasesExtractor(\n",
    "    llm=llm, property_name=\"keyphrases\", max_num=10\n",
    ")\n",
    "relation_builder = OverlapScoreBuilder(\n",
    "    property_name=\"keyphrases\",\n",
    "    new_property_name=\"overlap_score\",\n",
    "    threshold=0.01,\n",
    "    distance_threshold=0.9,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "7eb5f52e-4f9f-4333-bc71-ec795bf5dfff",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Applying KeyphrasesExtractor:   6%|██████▏                                                                                                         | 2/36 [00:01<00:17,  1.94it/s]Property 'keyphrases' already exists in node 'a2f389'. Skipping!\n",
      "Applying KeyphrasesExtractor:  17%|██████████████████▋                                                                                             | 6/36 [00:01<00:04,  6.37it/s]Property 'keyphrases' already exists in node '3068c0'. Skipping!\n",
      "Applying KeyphrasesExtractor:  53%|██████████████████████████████████████████████████████████▌                                                    | 19/36 [00:02<00:01,  8.88it/s]Property 'keyphrases' already exists in node '854bf7'. Skipping!\n",
      "Applying KeyphrasesExtractor:  78%|██████████████████████████████████████████████████████████████████████████████████████▎                        | 28/36 [00:03<00:00,  9.73it/s]Property 'keyphrases' already exists in node '2eeb07'. Skipping!\n",
      "Property 'keyphrases' already exists in node 'd68f83'. Skipping!\n",
      "Applying KeyphrasesExtractor:  83%|████████████████████████████████████████████████████████████████████████████████████████████▌                  | 30/36 [00:03<00:00,  9.35it/s]Property 'keyphrases' already exists in node '8fdbea'. Skipping!\n",
      "Applying KeyphrasesExtractor:  89%|██████████████████████████████████████████████████████████████████████████████████████████████████▋            | 32/36 [00:04<00:00,  7.76it/s]Property 'keyphrases' already exists in node 'ef6ae0'. Skipping!\n",
      "                                                                                                                                                                                  \r"
     ]
    }
   ],
   "source": [
    "transforms = [\n",
    "    headline_extractor,\n",
    "    headline_splitter,\n",
    "    keyphrase_extractor,\n",
    "    relation_builder,\n",
    "]\n",
    "\n",
    "apply_transforms(kg, transforms=transforms)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b7da1d6d",
   "metadata": {},
   "source": [
    "### Configure personas\n",
    "\n",
    "You can also do this automatically by using the [automatic persona generator](/docs/howtos/customizations/testgenerator/_persona_generator.md)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "id": "213d93e7-1233-4df7-8022-4827b683f0b3",
   "metadata": {},
   "outputs": [],
   "source": [
    "from ragas.testset.persona import Persona\n",
    "\n",
    "person1 = Persona(\n",
    "    name=\"gitlab employee\",\n",
    "    role_description=\"A junior gitlab employee curious on workings on gitlab\",\n",
    ")\n",
    "persona2 = Persona(\n",
    "    name=\"Hiring manager at gitlab\",\n",
    "    role_description=\"A hiring manager at gitlab trying to underestand hiring policies in gitlab\",\n",
    ")\n",
    "persona_list = [person1, persona2]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ced43cb5",
   "metadata": {},
   "source": [
    "### Create multi-hop query \n",
    "\n",
    "Inherit from `MultiHopQuerySynthesizer` and modify the function that generates scenarios for query creation. \n",
    "\n",
    "**Steps**:\n",
    "- find qualified set of (nodeA, relationship, nodeB) based on the relationships between nodes\n",
    "- For each qualified set\n",
    "    - Match the keyphrase with one or more persona. \n",
    "    - Create all possible combinations of (Nodes, Persona, Query Style, Query Length)\n",
    "    - Samples the required number of queries from the combinations\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 137,
   "id": "08db4335-4b00-4f06-b855-4c847675a801",
   "metadata": {},
   "outputs": [],
   "source": [
    "from dataclasses import dataclass\n",
    "import typing as t\n",
    "from ragas.testset.synthesizers.multi_hop.base import (\n",
    "    MultiHopQuerySynthesizer,\n",
    "    MultiHopScenario,\n",
    ")\n",
    "from ragas.testset.synthesizers.prompts import (\n",
    "    ThemesPersonasInput,\n",
    "    ThemesPersonasMatchingPrompt,\n",
    ")\n",
    "\n",
    "\n",
    "@dataclass\n",
    "class MyMultiHopQuery(MultiHopQuerySynthesizer):\n",
    "\n",
    "    theme_persona_matching_prompt = ThemesPersonasMatchingPrompt()\n",
    "\n",
    "    async def _generate_scenarios(\n",
    "        self,\n",
    "        n: int,\n",
    "        knowledge_graph,\n",
    "        persona_list,\n",
    "        callbacks,\n",
    "    ) -> t.List[MultiHopScenario]:\n",
    "\n",
    "        # query and get (node_a, rel, node_b) to create multi-hop queries\n",
    "        results = kg.find_two_nodes_single_rel(\n",
    "            relationship_condition=lambda rel: (\n",
    "                True if rel.type == \"keyphrases_overlap\" else False\n",
    "            )\n",
    "        )\n",
    "\n",
    "        num_sample_per_triplet = max(1, n // len(results))\n",
    "\n",
    "        scenarios = []\n",
    "        for triplet in results:\n",
    "            if len(scenarios) < n:\n",
    "                node_a, node_b = triplet[0], triplet[-1]\n",
    "                overlapped_keywords = triplet[1].properties[\"overlapped_items\"]\n",
    "                if overlapped_keywords:\n",
    "\n",
    "                    # match the keyword with a persona for query creation\n",
    "                    themes = list(dict(overlapped_keywords).keys())\n",
    "                    prompt_input = ThemesPersonasInput(\n",
    "                        themes=themes, personas=persona_list\n",
    "                    )\n",
    "                    persona_concepts = (\n",
    "                        await self.theme_persona_matching_prompt.generate(\n",
    "                            data=prompt_input, llm=self.llm, callbacks=callbacks\n",
    "                        )\n",
    "                    )\n",
    "\n",
    "                    overlapped_keywords = [list(item) for item in overlapped_keywords]\n",
    "\n",
    "                    # prepare and sample possible combinations\n",
    "                    base_scenarios = self.prepare_combinations(\n",
    "                        [node_a, node_b],\n",
    "                        overlapped_keywords,\n",
    "                        personas=persona_list,\n",
    "                        persona_item_mapping=persona_concepts.mapping,\n",
    "                        property_name=\"keyphrases\",\n",
    "                    )\n",
    "\n",
    "                    # get number of required samples from this triplet\n",
    "                    base_scenarios = self.sample_diverse_combinations(\n",
    "                        base_scenarios, num_sample_per_triplet\n",
    "                    )\n",
    "\n",
    "                    scenarios.extend(base_scenarios)\n",
    "\n",
    "        return scenarios"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 138,
   "id": "6935cdde-99c0-4893-8bd1-f72dc398eaee",
   "metadata": {},
   "outputs": [],
   "source": [
    "query = MyMultiHopQuery(llm=llm)\n",
    "scenarios = await query.generate_scenarios(\n",
    "    n=10, knowledge_graph=kg, persona_list=persona_list\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 151,
   "id": "78fec1b9-f8a1-4237-9721-65bdae7059f8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MultiHopScenario(\n",
       "nodes=2\n",
       "combinations=['Diversity Inclusion & Belonging', 'Diversity, Inclusion & Belonging Goals']\n",
       "style=Web search like queries\n",
       "length=short\n",
       "persona=name='Hiring manager at gitlab' role_description='A hiring manager at gitlab trying to underestand hiring policies in gitlab')"
      ]
     },
     "execution_count": 151,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "scenarios[4]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "49a38d27",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "61ae1d99",
   "metadata": {},
   "source": [
    "### Run the multi-hop query"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 143,
   "id": "da42bfb0-5122-4094-be22-6d6e74a9c0c0",
   "metadata": {},
   "outputs": [],
   "source": [
    "result = await query.generate_sample(scenario=scenarios[-1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 144,
   "id": "d4a865a7-b14b-4aa0-8def-128120cebae9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'How does GitLab ensure that its DIB roundtables are effective in promoting diversity and inclusion?'"
      ]
     },
     "execution_count": 144,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result.user_input"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b716f1a5",
   "metadata": {},
   "source": [
    "Yay! You have created a multi-hop query. Now you can create any such queries by creating and exploring relationships between documents."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5088c18-a8eb-4180-b066-46a8a795553b",
   "metadata": {},
   "source": [
    "## "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "ragas",
   "language": "python",
   "name": "ragas"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.20"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
